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In a Role-Playing Game, finding optimal trajectories is one of the most important tasks. In fact, the strategy decision system 
becomes a key component of a game engine. Determining the way in which decisions are taken (online, batch or simulated) and 
the consumed resources in decision making (e.g. execution time, memory) will influence, in mayor degree, the game performance. 
When classical search algorithms such as A* can be used, they are the very first option. Nevertheless, such methods rely on precise 
and complete models of the search space, and there are many interesting scenarios where their application is not possible. Then, 
model free methods for sequential decision making under uncertainty are the best choice. In this paper, we propose a heuristic 
planning strategy to incorporate the ability of heuristic-search in path-finding into a Dyna agent. The proposed Dyna-'T/ algorithm, 
as A* does, selects branches more likely to produce outcomes than other branches. Besides, it has the advantages of being a model- 
free online reinforcement learning algorithm. The proposal was evaluated against the one-step Q-Learning and Dyna-Q algorithms 
obtaining excellent experimental results: Dyna-'K significatively overcomes both methods in all experiments. We suggest also, a 
functional analogy between the proposed sampling from worst trajectories heuristic and the role of dreams (e.g. nightmares) in 
human behavior. 
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Decision support systems (DSS) are computer-based infor- 
mation systems that support business or any other organiza- 
tional decision-making activities. DSSs help to make decisions, 
which may be rapidly changing and not easily specified in ad- 
vance. DSSs include knowledge-based systems. The impor- 
tance of making a good decision in any business is evident. In 
a dynamic environment, decision processes not only need to be 
well designed but they must adapt rapidly to changes in the en- 
vironment. Existing work on decision making has centered on 
the concepts of rational and boundedly rational decision pro- 
cesses. Recent works include a third model of decision, based 
on the use of heuristics. 

In the last years, there has been an increasing interest in the 
issues of cost-sensitive learning and decision making, in a vari- 
ety of applications, in order to maximize the total benefits over 
time. A number of approaches have been developed that are 



effective at optimizing cost-sensitive decisions ( Lopez et al 
[2010 ; Iglesias et al.', '2008), some of them based on a synergy 
between different intelligent techniques and other fields that to- 



gether comprise what is called knowledge engineering ( Lu and 
|Ruanl[2007l ). 
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In any decision making strategy, an agent seeks to achieve 
a goal, despite uncertainty about its environment. The agent's 
actions influence the future state of the environment, thereby af- 
fecting the options and possible alternatives at later times. Cor- 
rect choice requires taking into account indirect, delayed con- 
sequences of actions, and thus may include foresight or plan- 
ning (Sutton and Barto||1998| . 

Among all the decisions involved in computer-games, the 
most common is probably path-finding, i.e., looking for a good 
route or path for moving an entity from here to there. The en- 
tity can be a single person, a vehicle, or a combat unit; the genre 
can be an action game, a simulator, a role-playing game, or a 
strategy game. The main focus of this research is to compute 
collision-free shortest-paths as quickly as possible. Although 
path-finding is not trivial, there are some well-established, solid 
algorithms that have been applied, some of them more efficient 
than others ( |BayiU and Polat||201l{ [Alvarez et"aL]|2010| l. 

In this paper we use, as the case study, the Role-Playing 
Games (RPG) scenario, where the player selects a target point 
(f) from its current position and the entity (e) is automatically 
taken to t without interacting with the system, avoiding ob- 
stacles and optimizing the trajectory. This automatic process 
can be carried out by different approaches (Karamouzas and] 
|Overmars]|2008| l. Most of the searching strategies proposed in 
the literature are included in the wide area of machine learn- 



ing (Alpaydin 2004 MitcheU 1997 1. When classical search 



algorithms such as A* can be used, they are the very first choice 
for computing optimal solutions. Nevertheless, these methods 
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can be computationally demanding, especially for very large 
environments. For instance, A* based algorithms usually de- 
mands quite high execution time since the decisions rely on a 
exhaustive planning strategy. Even more, such methods depend 
heavily on precise and complete models of the environment, 
e.g. the game arena. So, there are many interesting scenarios 
where they cannot be applied. Therefore, model free methods 
for sequential decision making problems under uncertainty are 
well suited to these cases since the incremental nature of its 
learning mechanisms and the direct action selection mechanism 
of its decision making procedures make it possible to use them 
in real-time applications. 

Many other applications of these learning strategies can be 
found in the literatm-e. Without being exhaustive, some recent 
paradigmatic examples can be cited. The airline ticket purchas- 
ing problem (! Gilmore||2008| l, where author uses different tech- 
niques to acquire a flight ticket at the lowest cost. MALADY: A 
Machine Learning-Based Autonomous Decision-Making Sys- 



tem for Sensor Networks ( ,Krishnamurthy et al. 2009) ), where 
sensor networks are able to learn and make decisions in real 
time. [Muse et"aL] ('2006') present a system for visual robotic 
docking using an omnidirectional camera coupled with the ac- 
tor critic reinforcement learning algorithm. In this case, a net- 
work trained via reinforcement allows the robot to turn to and 



approach a table to pick an object. Janssens et al. ( 2007^ ) present 
an application of reinforcement learning (Q-leaming) that sim- 
ulates time and location information for a given sequence of 
travel activities. Even in a different field such as education we 



can find some interesting applications (Iglesias et al. 2009 1. In 
this paper the process of learning pedagogical policies accord- 



ing to the students needs fits an RL problem. Kaelbling et al. 



( |1996| l and |Busoniu et al.| ( |2008| l have written surveys on rein- 
forcement learning and its applications. A heuristic method can 
use searching trees. However, instead of generating all possi- 
ble solution branches, a heuristic selects branches more likely 
to produce successful outcomes than other branches. It is se- 
lective at each decision point. This paper is an extension of a 
previous one on path-finding for RPGs (Alvarez et al. 2010| l. 

In this article, we introduce a novel algorithm that includes 
a heuristic planning module (sampling from the worst trajec- 
tories) and a function 'H (the a priori knowledge injected to 
the system) that can contain any kind of information that ex- 
press how bad is taking an action at a particular situation, for 
example, the Euclidean distance between a goal state and the 
current state. The proposed Dyna-'H algorithm is based on the 



well-known Dyna architecture ( Sutton 1991 Sutton and Barto 
[T998] l. 

Grid world like environments treated as Markov sequen- 
tial decision problems are used nowadays in many research 
works to evaluate and show the behavior of standard algorithms 
against new proposed ones. The results obtained in this test 
cases are easily generalizable to other problems, such as robot 
navigation, and, in general, any sequential decision problem. 
In this particular case, to an informed (i.e. knowledge-based) 
sequential decision problem. The proposed method, the one- 
step 2-learning and Dyna-Q algorithms have been applied to 
the same problem and compared in terms of learning rate. 



The structure of the paper is as follows. In section 2, the 
strategies that are going to be applied and compared are briefly 
described and the novel proposal is introduced. The experimen- 
tal scenario is described in Section 3. Results obtained by the 
different algorithms are discussed in Section 4. The last section 
(5) is devoted to the conclusions and further work. 

2. Search, Reinforcement learning and Planning 

The algorithms that are going to be compared are briefly de- 
scribed in this section. A new algorithm based on the Dyna 
architecture ( |Sutton| |1991| [Sutton and Barto| |1998| ), that com- 
bines heuristic on-line search and ^-Learning is presented. 
We focus on solving path planning problems for homogeneous 
agents in homogeneous environments. 

2.1. Heuristic search, the A* algorithm 

The predominant state-space planning methods in artificial 
intelligence are collectively known as heuristic search. Unlike 
other planning methods, heuristic search is not concerned with 
changing the approximate, value function, but only with im- 
proving the actions selection given the current value function. 

In heuristic search, for each state encountered, a large tree 
of possible alternatives is considered. The approximate value 
function is applied to the leaf nodes, and then backed up at the 
previous state towards the root. The backing up in the search 
tree is just the same as in the max-backups. This backing up 
stops at the state-action nodes of the current state. Once the 
backed-up values of these nodes are computed, the best of them 
is chosen as the current action, and the rest of the values are 
discarded. In conventional heuristic search no efibrt is made 
to save the backed-up values and the value function, once de- 
signed, never changes as a result of the search. However, it 
would be reasonable to allow the value function to be improved 
over time, using either the backed-up values computed during 
the heuristic search or by any other method. 

Heuristic methods such as A* based algorithms have been 
widely applied. Actually, in the game development commu- 
nity, the most popular path-planning is to divide the environ- 
ment into a grid that can be explored using these A* based al- 
gorithms. This approach works very well in computer games 
as it always retrieves the shortest path, if exists. This heuristic 
search ranks each node by an estimate of the best route through 
that node. It combines the tracking of the previous path length 



of Dijkstra's algorithm ( jDijkstra 1959| , with the heuristic esti- 
mate of the remaining path from best-first search. Since some 
nodes may be processed more than once, in order to find bet- 
ter paths later, it is necessary to keep track of them in a list. 
Adding this heuristic score to the nodes stored in Dijkstras pri- 
ority queue, the number of nodes visited during the search can 
be effectively pruned down. 

A* has a couple of interesting properties. It is guaranteed to 
find the shortest path, as long as the heuristic estimate is admis- 
sible. That is, it is never greater than the remaining distance to 
the goal. It makes the most efficient use of the heuristic func- 
tion: no search that uses the same heuristic function and finds 
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optimal paths will expand fewer nodes than A*, not counting 
tie-breaking among nodes of equal cost. A* turns out to be very 
flexible in practice. 

2.2. Reinforcement Learning (RL) 



Reinforcement Learning (Kaelbling et al. 1996 Sutton and 



Barto 1998 ) goes back to the very first stages of Artificial Intel- 



ligence and Machine Learning, and it has several applications in 
the Intelligent Knowledge Engineering Systems domain. They 



have been also successfully applied to game playing (Littman 
[T994|. 

Under a constrained environment, the learning agent can per- 
ceive a set S of distinct states, which are normally characterized 
by a number of dimensions, and it has a set A of possible actions 
at each state. Reinforcement learning tasks are generally dis- 
crete. At each time step f, the agent observes the current state i, 
and chooses a possible action a,, which leads to the succeeding 
state Sf+i - d(s,, at). Then, the environment generates a reward 
r{s,, at). These rewards can be positive, zero or negative and can 
have a delay. In other words, some actions and their state transi- 
tions may bring low rewards in short term, while they will lead 
to state-action pairs with a much higher reward later On the 
contrary, an action in a given state may receive an immediate 
high reward, whereas it makes the agent enter into a path where 
the following actions have very low or even negative rewards. 
Therefore, the task of the agent is to learn a policy n : S — > A, 
to achieve the maximum accumulative reward over time. 

Reinforcement learning agents are connected to the environ- 
ment by perceptions and actions. On each step of the interaction 
with the environment, the agent receives as input the current 
state and the value of that state. This value is the reward. The 
agent records the reward signal and updates the poUcy based on 
the information received about the reward so far. 

2-Learning ( Watkins and Dayan 1992| l is a popular method 
of model-free reinforcement learning. It can also be viewed as 



a method of asynchronous dynamic programming (DP) (Bell- 
|man[|1957[|Bellman and Dreyfus, 1962) . Reinforcement Learn- 
ing provides agents with the capability of learning from inter- 
actions with the environment, to act optimally in Markovian 
domains by experiencing the consequences of actions, without 
requiring them to receive or build maps (models) of the do- 
mains ( |Grzes and Kudenkol|2010| i. 

Learning proceeds similarly to Sutton's method of temporal 
differences (TD) ( Sutton and Barto 1998[ l: an agent tries an ac- 



tion at a particular state, and evaluates its consequences in terms 
of the immediate reward or penalty it receives and its estimate 
of the value of the state to which is taken. By trying all actions 
in all states repeatedly, it learns which ones are the best overall, 
judged by long-term discounted cumulative reward ( jTesauro, 
[T992|. 

A probabilistic approach is commonly used in Q-learning. 
A straightforward strategy is the e-greedy method, where the 
probability of making a random choice is handled by the pa- 
rameter e. In every step, with probability 1 - e, the agent 
fully exploits the information stored in the Q-values, and with 
probability e the agent chooses a random action in order to ex- 
plore the state space. In the exploration mode, the e-greedy 
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Figure 1 : Information flow in the Dyna architecture 



Algorithm 1 Dyna-Q algorithm, as proposed by Sutton ( 1991 ). 



Initialize Q(s, a), Model{s, a)V s e S, a e J{ 
repeat {for each episode) 

s <— current(non terminal) state 

a <— e-greedy(s, Q) 

execute a; observe s' and r 

Q{s,a) <- Q{s,a) + a[r + ymax„, Q(s',a')- Q(s,a)] 

Model{s,a) <— s' , r 

for / = 1 to do 

s <— random previously observed state 
a «— random action previously taken in s 
s', r «— Modelis, a) 

Q{s, a) «— Q{s, a) + a{r + y maXo- Q{s' ,a') - Q{s, a)} 
end for 
until s' is terminal 



method assumes equal selection probabilities for any possible 
action, whereas the chance of selecting a better action may be 
increased by taking the current value distribution between al- 
ternatives such as in the soft-max methods ( jSutton and Barto( 
|T998] l. 

2.3. The Dyna architecture 

Planning is usually referred to any computational process 
that takes a model as input and produces or improves a policy 
to interact with the modeled environment. Although there are 
different approaches, state-space planning is mainly a search 
through the state space for an optimal path. Actions cause tran- 
sitions from one state to another, and value functions are com- 
puted over states. 

In on-line planning, new information is gained from the in- 
teraction with the environment and may change the model. 
If decision-making and model-learning are both computation- 
intensive processes, it may be necessary to divide the available 
computational resources between them. Dyna ( |Sutton| |1991| l, 
is a reinforcement learning architecture that easily integrates 
incremental reinforcement learning and on-line planning. 

The possible relationship between experience, model and 
values for Dyna- 2 are described in figure [T| Each arrow shows 
a relationship of influence. Note how experience can improve 
the model and therefore the value function, either directly or 
indirectly. It is the latter, which is sometimes called indirect 
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reinforcement learning, which is involved in planning. In algo- 
rithm [T] where Dyna-2 is described, Model{s,a) denotes the 
contents of the model (predicted next state and reward, respec- 
tively) for state-action pair (s,a). Direct reinforcement learn- 
ing, model-learning, and planning are implemented by steps 6, 
7 and 8, respectively. If steps 7 and 8 were omitted, the remain- 
ing algorithm would be one-step tabular Q-learning. 

Dyna-Q includes all of these processes: planning, acting, 
model-learning, and direct RL, continually. The planning 
method is the random-sample one-step Q-planning. The di- 
rect RL method is the one-step Q-learning. The model-learning 
method is table-based and assumes the world is deterministic. 
After each transition, the model records the prediction that will 
deterministically follow. Thus, if the model is queried with a 
state-action pair that has been experienced before, it simply re- 
turns the last-observed next state and next reward as its predic- 
tion. During planning, the algorithm randomly samples only of 
state-action pairs that have been previously experienced. Con- 
ceptually, planning, acting, model-learning, and direct RL oc- 
cur simultaneously and in parallel in Dyna agents ( |Sutton and] 
|Bartol[T998] l. 

3. A heuristic planning reinforcement learning algorithm 
based on the Dyna architecture 

Here we propose a heuristic planning strategy to incorporate 
into a Dyna agent the advantages of a particular heuristic, in 
order to find the shortest paths in grid like environments, e.g. 
RPGs. A heuristic search method, as a search method after all, 
can be defined in terms of traversing a search tree. However, 
instead of generating all possible solution branches, a heuristic 
method selects branches more likely to produce successful out- 
comes than others. It is selective at each decision point. The 
proposed method incorporates the ability of heuristic search, 
e.g. A*, to focus on specific search subtrees in order to make the 
searching more efficient. At the same time, the method learns 
online as any other common reinforcement learning algorithm 
and does not requieres a complete model of the environment 
before staring to search. 

3.1. Sampling from the worst trajectories (the nightmares 
metaphor) 

Contrary to intuition, the proposed sampling strategy con- 
sist in using a learned model of the environment and travel- 
ing across it using the worst trajectories with respect to some 
heuristic index (e.g. a priori knowledge of the domain), receiv- 
ing thus the worst rewards. However, this lead the algorithm 
to find the solution faster that using any other a priori better 
approach. 

Sampling from "bad" trajectories using simulated experience 
has a very interesting analogous in human behavior: night- 
mares. This analogy suggests that such strategy can be con- 
sidered as an interesting candidate hypothesis about the role of 
nightmares in human behavior, assigning thus a specific func- 
tion to this behavior: a tool used by our brain to reorganize 
some goal oriented behaviors using the resting time to learn 



based on imagination (simulated experience). Furthermore, 
Figure |9] (in section |4]) show different trajectories using this 
sampling strategy. As can be seen, these trajectories present 
some discontinuities (abrupt jumps) and also pass through the 
walls, i.e. violates the physical laws; things that are very com- 
mon in dreams. 

The analogous heuristic, in this case, to the 'H function, 
could be associated with the so called value-systems, which 
shape human behavior (Edelman 1987[ Sporns etal. 2000). In- 
deed, there is a growing body of research about value-systems 
in robotics and autonomous agents in order to design robots 
with adaptive, lifelong learning behavior, because this values- 
systems are a way for robots to behave autonomously through 
spontaneous, self-generated activity (Merrick 2010[ l. In con- 
nection with autonomous agents many kinds of different value- 
systems, based on some aspects of human behavior related to 
motivation, e.g. curiosity driven, intrinsic motivated, novelty 
detection, have been proposed. However, it seems that there 
is (up to our knowledge) no publication along this line of re- 
search relating the study of dreams and value-systems with the 
Reinforcement Learning and Planning field. 

3.2. The Dyna-'H algorithm 

In RPGs and grid world like environments in general, it is 
common to use the Euclidian or city-clock distance functions as 
an effective heuristic. In this case study, the euclidian distance 
is used for the heuristic ("K) planning module. However, in 
general, 'His, a) represents a general function that gives a guess 
about how bad is to take action a in state s, e.g. the euclidian 
distance between the state s' and the goal position ([T]i. 

'H{s,a)^\\s' -goal\\\ (1) 

where the s' state is the result of the model query: s' - 
Model{s,a). 

Hence, given the heuristic "H, the heuristic action ha is de- 
fined as: 

ha(s, 'H) — argmax "His, a), (2) 

a 

where ha{s,'H) is the worst action following CH), e.g. the ac- 
tion that yields the higher distance from the goal. Algorithm |2] 
describes the steps of this strategy. 

4. Experimental scenario 

The Dyna-'TY heuristic planning algorithm have been eval- 
uated and compared in terms of learning rate to the one-step 
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Figure 2: The experimental scenario, starting point (S), goal (G), obstacles 
(gray), and a sample trajectory. 
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Algorithm 2 The proposed Dyna-'T/ heuristic planning algo- 
rithm 

1: Initialize Q(s, a), Model(s, a)^ s eS, a e J[ 

2: repeat {for each episode) 

3: i <— current(non terminal) state 

4: fl <— e-greedy{s, Q) 

5: execute a; observe s' and r 

6: Q{s,a) «— Q{s,a) + a[r + ymax^' Q(s' ,a') - Q(s,a)\ 

7: Model(s,a) «— .s',r 

8: for i - 1 to do 

9: fl <- ha(s,9{) 

10: if fl ^ Model then 

11: i <— random previously observed state 

12: fl <— random action previously taken in s 

13: end if 

14: s' ,r <r- Model(s,d) 

15: Q{s, fl) <— fl) + a{r + y maXfl' Q{s' ,a') - Q{s, fl)] 

16: s «- s' 

17: end for 

18: until s' is terminal 



(2-Learning and Dyna-Q algorithms for the same problem. 

The experiment consists of searching for optimal paths, i.e., 
the shortest path with the lowest cost between two states. To 
study this problem in the context of reinforcement learning, we 
assume that it is a Markov decision process, where there is a 
set of possible states and a set of actions. A typical problem 
in path-finding is obstacle avoidance. The simplest approach 
to this problem is to ignore obstacles until jumping into them. 
This approach is simpler because it makes few demands: all 
that it needs is the relative position of the entity and its goal, 
and whether the immediate vicinity is blocked. For many game 
situations, this is good enough. But there are scenarios where 
the only intelligent approach would be to plan the entire route 
in advance. 

In this paper, the playing space is represented with square 
tiles as a 39 X 36 grid (figure[2]i. The obstacles are walls that are 
set randomly (in gray). The state is the tile or position where 
the entity is located. Neighboring states would vary depending 
on the game and the local situation. The cost of going from one 
position to another can represent many things: in this case it 
is computed as the simple distance between the two positions, 
which in RL terminology is equivalent to set r = -1 for all non- 
terminal state transitions, minimizing thus the total distance, 
i.e. finding an optimal path. The grid is represented as a two 
dimensional matrix of 39 rows and 36 columns. This matrix 
establishes the communication between nodes or states; each 
node can be related up to four neighbors, depending on the type 
of each node, i.e. up (t), down (X), left (<— ) and right (— >). 




Figure 3: Average learning curve over 30 njns for the one-step tabular Q- 
Leaming algorithm 




Figure 4: Average learning curve over 30 runs for the Dyna-Q model with 
random sample with 10 planning steps 



(figures [4] and |7]l and the proposed heuristic planning Dyna-TY 
algorithm (figures [5] and [8]). 



As in Dyna maze ( |Sutton and Barto 1998 1, all the tests were 
based on the one-step Q-Learning algorithm with a set of fixed 
parameters. The initial action values are zero, i.e. Q{s, a) - 0, 
the step-size parameter is a - 0.1, and the exploration param- 
eter was fixed to e - 0.1. When selecting greedily among 
actions, ties were broken randomly. For each algorithm, the 
learning curve shows the number of steps taken by the agent in 
each episode, averaged over 30 runs, each run consisting on a 
randomly generated labyrinth except from the staring (1,4) and 
goal (28, 34) positions that remained constant during all exper- 
iments. Each random labyrinth was obtained using the same 
probability distribution (normal with ji - Q,a- - 0.3) for every 
square tile of the grid, as shown in Q. 



5. Experimental Results 

Figures [3] to [To] show the results of the simulations. As ex- 
plained before, we have compared the performance of three 
algorithms: one-step Q-Learning (figures [3] and |6]l, Dyna-Q 



(p{x) ^N(jJ^O,cr^ ^ 0.3^), (3) 

tiletype - sgn (abs (Round (^(x)))) ; (4) 

where tiletype - 1 means that there is an obstacle and 
tiletype - 1 indicates a free tile. 
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Figure 5 : Average learning curve over 30 runs for tlie proposed Dyna-'K heuris- 
tic planning algorithm with 10 planning steps 




Figure 6: Trajectory describing the best path found by the one-step g-Leaming 
algorithm after 100 episodes for the first experiment. 




Figure 7: Trajectory describing the best path found by the Dyna-Q planning 
algorithm (10 planning steps) after 100 episodes for the first experiment. 




Figure 8: Trajectory describing the best path found by the proposed Dyna-'K 
heuristic planning algorithm (10 planning steps) after 100 episodes for the first 
experiment. 

For each different algorithm, the initial seed for the random 
number generator was held constant, hence, all are evaluated 
on the same set of 30 different grid configurations. For Dyna-Q 
and Dyna-'K, the number of planning steps was fixed to 10. All 
experiments ran for up to 100 episodes. 

Figure|3]shows the learning curve of the one-step Q-Learning 
algorithm. As it can bee seen, this is the slowest method and 
thus it serves as a standard for comparisons. The Q-Learning 
agent presents a very slow convergence curve and in fact it 
never found the optimal policy. It started with 2000 steps and 
showed a constant policy improvement during the 100 episodes, 
ending with approximately 1400 time steps. In figure|6]the best 
path found by the one-step Q-Learning algorithm is shown. 

Figure [4] shows the learning rate of the Dyna-Q algorithm. 
As expected, the Dyna agent improved the learning curve re- 
garding the on-step Q-Learning algorithm. The Dyna-Q agent 
presents a "reasonable" convergence curve. However, it never 
found the optimal policy. It started with 2000 steps and showed 
a high policy improvement up to episode 40, were the agent 
continued improving but with a slower rate, almost constant 
(linear like) factor during the remaining 60 episodes, ending 
the learning with around 400 time steps. Although it presents 
a good behavior, it could not found the optimal trajectory dur- 
ing the simulation time. Next we present some examples of the 
kind of solution trajectories generated by each algorithm. These 
solutions corresponds to the first experiment of each algorithm 
evaluation. In figure|7]the best path found by Dyna-g algorithm 
is shown. 

Figure [5] shows the behavior of the proposed heuristic plan- 
ning algorithm. As it is possible to see, the heuristic-planning 
agent improved a lot regarding the learning curve in comparison 
to the other algorithms. It presents an exponential convergence 
until the optimal policy is found. It started with 1600 steps and 
reduced them drastically up to episode 10, where it reaches the 
optimum (80 steps per episode). This means a high improve- 
ment both in the learning speed and the quality of the policy 
found. In figure [S] the best path found by Dyna-'K algorithm is 
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Figure 10: Comparison of the average learning curve over 30 runs for Q-Leaming, Dyna-Q (random sample with 10 planning steps) and the proposed Dyna-'H 
heuristic planning algorithm (with 10 planning steps) 
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Figure 11: Average learning rates (over 30 runs) of the proposed Dyna-'H heuristic planning algorithm for dilferent numbers of planning steps, N = 1,5, 10 and 25 



shown. It can be seen that the generated path is very close to 
the optimal path. 

Figure|9]shows several examples of the trajectories generated 
by the heuristic sampling planning procedure. The trajectories 
shown are all taken from the first episode of the first experiment 
of the Dyna-'T/ algorithm and represent successive time steps of 
the episode. In these images, it is possible to appreciate clearly 
that the trajectories generated by the heuristic sampling strategy 
are almost the worst or very bad with respect to the solution, i.e. 
sampling from the worst trajectories, as defined by the Dyna-'K 
algorithm and that using these trajectories the algorithm learns 
extremely well. 

In figure [TOj the average learning curves of the three algo- 
rithms are shown. The difference in terms of learning rate ex- 
hibited by the Dyna-TY algorithm is evident. 

As [Sutton and Barto| ( |1998| l comment, in the short term, 
sampling according to, for instance, the on-policy distribution 
helps to focus on states that are close descendants of the initial 
state. On the other hand, in the long run, focusing on the on- 
pohcy distribution may make the convergence worse because 
the most visited states have already their correct values. Sam- 
pling them is useless, whereas sampling other states may ac- 
tually help. This can be the reason why the exhaustive, unfo- 
cused approach, works better in the long run, at least for small 
problems. Although it may seem the same case, the proposed 
planning process does exactly the contrary to what would be an 
optimal policy (the policy to which the on-policy distribution 



should converge), focusing on apparently not very promising 
branches. However, by sampling from the worst trajectories, 
the learned policy converge quickly to the optimal one. 



In figure 11 an analysis of the convergence of the proposed 
Dyna-'H algorithm, for different numbers of planning steps 
is shown. The proposed heuristic planning algorithm have been 
tested for - 1,5, 10 and 25 planning steps. For = 1, 
the algorithm converges in a few steps, around the 7th episode. 
However, it converges to a local suboptimal solution around 370 
steps per episode. For N - 5, the algorithm also converges in 
around 7 episodes but it converges to a suboptimal solution that 
is significantly better than for the previous case, reaching an 
average of 250 steps per episodes. The cases of N = 10 and 
N - 25 show an identical convergence pattern as the N = 5 
case but they reach better optimal policies. 

It is quite significant that the case = 1 presents the 
same convergence rate than much higher planning rates, but it 
finds much worse policies. However, dealing with problems 
where the system should save computational resources, it can 
achieve a good compromise between optimality and computa- 
tional time. The learning curves for A^ = 5 up to A^ = 25 are 
identical, being the only difference the optimality of the policy 
reached, that is, the length of the path from the initial node to 
the goal. Again, this behavior is quite interesting since it indi- 
cates that the trade off between optimality and computational 
resources can be directly controller by tuning the number of 
planning steps. 
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6. Conclusions and further work 

In this paper we have presented a novel reinforcement 
learning-planning algorithm, Dyna-'K, that integrates planning 
and learning into an online algorithm based on the well known 
Dyna architecture. The proposed method involves heuristic in 
the planning module. It incorporates the ability of A* to focus 
on specific search subtrees in order to make the search more 
efficient by taking advantage of the heuristic. Besides, it is a 
model free strategy that can be applied to sequential decision 
making problems under uncertainty. 

A scenario to compare three learning algorithms: Q- 
Learning, Dyna-Q and the proposed Dyna-'H, has been de- 
signed. The results (learning rate and convergence and pol- 
icy found) obtained by all these methods have been shown and 
discussed. The new algorithm gives the best trajectories and 
the number of steps is reduced in more than the 90% with 
the Dyna-'K strategy. From this results, we can conclude that 
the proposed Dyna-TY heuristic planning algorithm is an effec- 
tive strategy in path-finding problems and therefore for Role- 
Playing Games. 

Since the main diff'erence between Dyna-Q and the proposed 
Dyna-"// method is the use of a heuristic that guides the plan- 
ning process when exploring the model, it makes sense to con- 
clude that, under some well defined scenarios such as informed 
search methods, random sampling can be improved significa- 
tively. 

We expect the successful application of the proposed algo- 
rithm to many related problems. 

Further work should include the application of the proposed 
heuristic planning algorithm to different domains, for example, 
stochastic environments such as capture games for chaotic mov- 
ing targets. 

Software 

An open-source Matlab^''^ implementation of the Dyna-TY 
algorithm can be obtained from the following direction: httpjj 
[//www . dacya . ucm . es/ j am/ downloads/Dyna-H . rar 
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